Web Documents Categorization using Fuzzy Representation and HAC

نویسندگان

  • Jiawei Deng
  • Lihui Chen
چکیده

Most of the existing techniques for characterization of Web documents are based on term-frequent), analysis. In such models, given a set of documents, the characterization of each document is represented by a feature vector in a vector space. Howevel; as Web documents written in HTML are semi-structured documents by means of tags, the traditional techniques that assign term weights only by the frequency of occurrence may not be able to provide satisfactory results in representing the contents of such documents. Some recent studies have shown that the f u u y representation (FR) of WWW information based on SigniJicance of HTML tag is an effective alternative fo r characterizing Web documents. In this papel; the FR to generate the feature vector for each Web document and the Hierarchical Agglomerative Clustering (HAC) algorithm are applied to investigate the efficiency and tfectiveness fo r automatic categorization of Web documents with similar contents. Experiments conducted suggest several benefits of using such an approach.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Visualization Approach to Automatic Text Documents Categorization Based on HAC

The ability to visualize documents into clusters is very essential. The best data summarization technique could be used to summarize data but a poor representation or visualization of it will be totally misleading. As proposed in many researches, clustering techniques are applied and the results are produced when documents are grouped in clusters. However, in some cases, user may want to know t...

متن کامل

A General Fuzzy-Based Framework for Text Representation and Its Application to Text Categorization

In this paper we develop the general framework for text representation based on fuzzy set theory. This work is extended from our original ideas [5],[4], in which a document is represented by a set of fuzzy concepts. The importance degree of these fuzzy concepts characterize the semantics of documents and can be calculated by a specified aggregation function of

متن کامل

Generating and Applying Rules for Web Documents Retrieval

Web documents retrieval is very challenging due to the huge amount of documents available and difficulty to interpret these documents. Both effectiveness and efficency of retrieval are important. This paper presents some approaches from soft computing to improve effectiveness of web documents retrieval. These approaches give a more accurate and reasonable representation of terms provided by the...

متن کامل

Conceptual matching in web search using FIS-CRM for representing documents

In this paper a new approach for achieving the conceptual matching between user queries and web documents is presented. The key of the proposed system is to use FIS-CRM (Fuzzy Interrelations and Synonymy Concept Representation Model) to represent the indexed web pages. This model (also implemented in the FISS metasearcher) is supported by a fuzzy synonymy dictionary and various thematic fuzzy o...

متن کامل

A Novel Approach for Web Document Classification

The web is a huge repository of information and there is a need for categorizing web documents to facilitate the search and retrieval of documents. Web document classification plays an important role in information organization and retrieval.This paper presents a fuzzy set based approach for automatically classifying web documents into one of the classes represented by a set of training documen...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000